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Abstract: Clustering analysis is one of the most widely used statistical 
tools in many emerging areas such as microarray data analysis. For microar- 
ray and other high-dimensional data, the presence of many noise variables 
may mask underlying clustering structures. Hence removing noise variables 
via variable selection is necessary. For simultaneous variable selection and 
parameter estimation, existing penalized likelihood approaches in model- 
based clustering analysis all assume a common diagonal covariance matrix 
across clusters, which however may not hold in practice. To analyze high- 
dimensional data, particularly those with relatively low sample sizes, this 
article introduces a novel approach that shrinks the variances together with 
means, in a more general situation with cluster-specific (diagonal) covari- 
ance matrices. Furthermore, selection of grouped variables via inclusion 
or exclusion of a group of variables altogether is permitted by a specific 
form of penalty, which facilitates incorporating subject-matter knowledge, 
such as gene functions in clustering microarray samples for disease subtype 
discovery. For implementation, EM algorithms are derived for parameter es- 
timation, in which the M-steps clearly demonstrate the effects of shrinkage 
and thresholding. Numerical examples, including an application to acute 
leukemia subtype discovery with microarray gene expression data, are pro- 
vided to demonstrate the utility and advantage of the proposed method. 
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1. Introduction 

Clustering analysis is perhaps the most widely used analysis method for mi- 
croarray data: it has been used for gene function discovery (Eisen et al. 1998 
[10]) and cancer subtype discovery (Golub et al. 1999 [1-^)]). In such an apphca- 
tion involving a large number of genes arrayed, it is necessary but challenging 
to choose a set of informative genes for clustering. If some informative ones are 
excluded because fewer genes are used, then it becomes difficult or impossible to 
discriminate some phenotypes of interest such as cancer subtypes. On the other 
hand, using redundant genes introduces noise, leading to the failure to uncover 
the underlying clustering structure. For example, Alaiya et al. (2002) [1] con- 
sidered borderline ovarian tumor classification via clustering protein expression 
profiles: using all 1584 protein spots on an array failed to achieve an accurate 
classification, while an appropriate selection of spots (based on discriminating 
between benign and malignant tumors) did give biologically more meaningful 
results. 

In spite of its importance, it is not always clear how to select genes for clus- 
tering. In particular, as demonstrated by Pan and Shen (2007) [39] and Pan et 
al. (2006) [.37], unlike in the context of supervised learning, including regression, 
best subset selection, one of the most widely used model selection methods for 
supervised learning, fails for clustering and semi-supervised learning, in addition 
to its prohibitive computing cost for high-dimensional data; the reason is the 
existence of many correct models, most of which are not of interest. In a review 
of the earlier literature on this problem, Gnanadesikan et al. (1995) [14] com- 
mented that "One of the thorniest aspects of cluster analysis continue to be the 
weighting and selection of variables" . More recently, Raftery and Dean (2006) 
[41] pointed out that "Less work has been done on variable selection for clus- 
tering than for classification (also called discrimination or supervised learning), 
perhaps reflecting the fact that the former is a harder problem. In particular, 
variable selection and dimension reduction in the context of model-based clus- 
tering have not received much attention" . For variable selection in model-based 
clustering, most of the recent researches fall into two lines: one is Bayesian (Liu 
et al. 2003 [30]; Hoff 2006 [18]; Tadesse et al. 2005 [43]; Kim et al. 2006 [2-5]), 
while the other is penalized likelihood (Pan and Shen 2007 [39]; Xie et al. 2008 
[52]; Wang and Zhu 2008 ["')]). The basic statistical models of these approaches 
are all the same: informative variables are assumed to come from a mixture of 
Normals, corresponding to clusters, while noise variables coming from a single 
Normal distribution; they differ in how they are implemented. In particular, 
the Bayesian approaches are more flexible than the penalized methods (because 
the latter all require a common diagonal covariance matrix, though our main 
goal here is to relax this assumption), but they are also computationally more 
demanding because of their use of MCMC for stochastic search; furthermore, 
penalized methods enjoy the flexibility of the use of penalty functions, such as to 
accommodate grouped parameters or variables as to be discussed later. Other 
recent efforts include the following: Raftery and Dean (2006) [41] considered 
a sequential, stepwise approach to variable selection in model-based clustering; 



B. Xie et al. /Penalized model-based clustering 



171 



however, as acknowledged by the authors, "when the number of variables is vast 
(e.g., in microarray data analysis when thousands of genes may be the variables 
being used), the method is too slow to be practical as it stands". Friedman 
and Meulman (2004) [ I I] dealt with a more general problem: selecting possibly 
different subsets of variables and their associated weights for different clusters 
for non- model-based clustering; Hoff (2004) [17] pointed out that the method 
might only "pick up the change in variance but not the mean" , and advocated 
the use of his model-based approach (Hoff 2006 [18]). Mangasarian and Wild 
(2004) [32] proposed the use of Li penalty for K-median clustering; the idea 
with the use of Li penalty is similar to ours, but we consider a more general 
case with cluster-specific variance parameters. 

The penalized methods proposed so far for simultaneous variable selection 
and model fitting in model-based clustering all assume a common diagonal 
covariance matrix. For high-dimensional data, it may be necessary to utilize 
a diagonal covariance matrix for model-based clustering; even for supervised 
learning, it has been shown that using a diagonal covariance matrix in naive 
Bayes discriminant analysis or its variants is more effective than that of a more 
general covariance matrix (Bickel and Levina 2004 [5]; Dudoit et al. 2002 [7]; 
Tibshirani et al. 2003 [47]). Hence we will restrict our discussion to a diagonal 
covariance matrix in what follows. On the other hand, the common (diagonal) 
covariance matrix assumption implies that the clusters all have the same size, 
as in the K-means method (which further assumes that all the clusters are 
sphere-shaped with a scaled identity matrix as the covariance). Of course, this 
assumption may be violated in practice. A general argument is the following: it 
is well known that the variance of gene expression levels is in general a function 
of the mean expression levels, suggesting possibly varying variances of a gene's 
expression levels across clusters with different mean expression levels; this point 
is going to be verified for our real data example. Here we extend the method to 
allow for cluster-dependent (diagonal) covariance matrices, which is nontrivial 
and requires a suitable construction of penalty function. 

In some applications, there is prior knowledge about grouping variables: some 
variables function as a group; either all of them or none of them is informative. 
Yuan and Lin (2006) [54] discussed this issue in the context of penalized regres- 
sion; they demonstrated convincingly the efficiency gain from incorporating such 
prior knowledge. On the other hand, in genomic studies of clustering samples 
through gene expression profiles, it is known that genes function in groups as 
in biological pathways. Hence, rather than treating genes individually, it seems 
natural to apply biological knowledge on gene functions to group the genes 
accordingly in clustering microarray samples, which has not been considered 
in previous applications of model-based clustering of expression profiles (e.g. 
Ghosh and Chinnaiyan 2002 [13]; Li and Hong 2001 [27]; McLachlan et al. 2002 
[3-3]; Yeung et al. 2001 [53]). Note that, a few existing works clustered genes by 
incorporating gene function annotations in a weaker form that did not require 
either all or none of a group of genes to appear in a final model: Pan (2006b) [38] 
treated the genes within the same functional group as sharing the same prior 
probability of being in a cluster, while genes from different groups might not 
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have the same prior probabihty, in model-based clustering of genes; others took 
account of gene groups in the definition of a distance metric in other clustering 
methods (Huang and Pan 2006 [20]). In addition, the aforementioned clustering 
methods did not allow for variable selection directly, while it is our main aim 
to consider variable selection, possibly assisted with biological knowledge. This 
is in line with the currently increasing interest in incorporating biological in- 
formation on gene functional groups into analysis of detecting differential gene 
expression (e.g. Pan 2006 [37]; Efron and Tibshirani 2007 [s]; Newton et al. 2007 



The rest of this article is organized as follows. Section 2 briefly reviews the 
penalized model-based clustering method with a common diagonal covariance, 
followed by our proposed methods that allow for cluster-specific diagonal co- 
variance matrices and for grouped variables. The EM algorithms for implement- 
ing the proposed methods are also detailed, in which the M-steps characterize 
the penalized mean and variance estimators with clear effects of shrinkage and 
thresholding. Simulation results in section 3 and an application to real microar- 
ray data in section 4 illustrate the utility of the new methods and their advan- 
tages over other methods. Section 5 presents a summary and a short discussion 
on future work. 

2. Methods 

2.1. Mixture model and its penalized likelihood 

We have /C-dimensional observations Xj,j = 1, . . . ,7i. It is assumed that the 
data have been standardized to have sample mean and sample variance 1 
across the n observations for each variable. The observations are assumed to be 
(marginally) iid from a mixture distribution with g components: '^ifi{xj',(^i): 
where 6i is an unknown parameter vector of the distribution for component i 
while TTi is a prior probability for component i. To obtain the maximum penal- 
ized likelihood estimator (MPLE), we maximize the penalized log-likelihood 



where Q represents all unknown parameters and pa(0) is a penalty with regular- 
ization parameter A. The specific form of Pa(0) depends on the aim of analysis. 
For variable selection, the Li penalty is adopted as in the Lasso (Tibshirani 



Denote by Zij the indicator of whether Xj is from component i; that is, Zy = 
1 if Xj is indeed from component i, and Zy = otherwise. Because we do 
not observe which component an observation comes from, z^-'s are regarded as 
missing data. If z^ 's could be observed, then the log-penalized-likelihood for 
complete data is: 



[3G]). 



n g 



logLp(e) = ;^iog Y,TT,f,{xf,e,) -px{e) 



1996 [4G]). 




(1) 
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Let X = {xj : j = 1, . . . ,n} represent the observed data. To maximize logip, 
the EM algorithm is often used (Dempster et al. 1977 [(>]). The E-step of the 
EM calculates 

Qp(e;eW) = Eeo, {log L,^p\X) = ^ ^ r^^ [log^, + log ; 0,)] - PA(e), 

i j 

(2) 

while the M-step maximizes Qp to update estimated O. In the sequel, because 
Tij 's always depend on r, for simplicity we may suppress the explicit dependence 
from the notation. 



2.2. Penalized clustering with a common covariance matrix 

Pan and Shen (2007) [V-)] specified each component fi as a Normal distribution 
with a common diagonal covariance structure V: 

= (27r)^/Vl^/' ^"^^ i^^^"" ' l^i)'y'\^ - ^^^) 

where V = diag{aj, erf, ... , (jj^), and \V\ = IIaLi ^fc- They proposed a penalty 
function px{Q) with an Li norm involving the mean parameters alone: 

a K 

PA(e) = Ai^^lM..|, (3) 

1=1 fe=l 

where & are the components of /i^, the mean of cluster i. Note that, because 
the data have been standardized to have sample mean and variance 1 for each 
variable k, if = ^gi^ = 0, then variable k is noninformative in terms 

of clustering and can be considered as a noise variable and excluded from the 
clustering analysis. The Li penalty function used in (3) can effectively shrink a 
small ^.ik to be exactly 0. 

For completeness and to compare with the proposed methods, we list the EM 
updates to maximize the above penalized likelihood (Pan and Shen 2007 [39]). 
We use a generic notation G^''-' to represent the parameter estimate at iteration 
r. For the posterior probability of Xj 's coming from component i, we have 

for the prior probability of an observation from the i*'' component /i, 

n 

for the variance for variable fe, 

-^'^^^' = EE^^(-.^-Ali^)Vn, (6) 
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and for the mean for variable k in cluster i 

-ir+l) 



^Ylii=\^ii \ V}li]=\^ij ^jfcl 



with i = 1, 2, . . . , g and fc = 1,2,..., /i . Evidently, we have jiik = if Ai is 
large enough. As discussed earlier, if /in, = /i2fc = • • • = /igj. = for variable A:, 
variable /c is a noise variable that does not contribute to clustering. 



2.3. Penalized clustering with unequal covariance matrices 

To allow varying cluster sizes, we consider a more general model with cluster- 
dependent diagonal covariance matrices: 

/'(^^ = (2^)Jf/2|v-|i/2 (-^(^ - f'^yv^'i^ - /^O) (8) 

where Vi = diag[af^,a'i2, crf^), and \V,\ = nf=i (^^k■ 

As discussed earlier, to realize variable selection, we require that a noise 
variable have a common mean and a common variance across clusters. Hence, 
the penalty has to involve both the mean and variance parameters. We shall 
penalize the mean parameters in the same way as before, while the variance 
parameters can be regularized in two ways: shrink afj. towards 1, or shrink 
log crff^ towards 0. 



2.3.1. Regularization of variance parameters: scheme one 

First, we will use the following penalty for both mean and variance parameters: 

a K g K 

Px{Q) = Al ^ ^ + A2 £ E l'^''^- - (9) 

i=l k=l i=l k=l 

Again the Li norm is used to coerce a small estimate of fiik to be exactly 0, 
while forcing an estimate of a^/, that is close to 1 to be exactly 1. Therefore, 
if a variable has common mean and common variance 1 across clusters, this 
variable is effectively treated as a noise variable; this aspect is evidenced from 
(4), where a noise variable does not contribute to the posterior probability and 
thus clustering. 

Note that penalty (9) differs from the so-called double penalization in non- 
parametric mixed-effect models for longitudinal and other correlated data (Lin 
and Zhang 1999 [29]; Gu and Ma 2005 [16]): aside from the obvious differences 
in the choice of the Li-norm here versus the L2-norm there and in clustering 
here versus regression there, they penalized fixed- and random-effect parameters, 
both mean parameters, whereas we regularize variance parameters in addition 
to mean parameters. Ma et al. (2006) [31] apphed such a mixed-effect model 
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to cluster genes with time course (and thus correlated) expression profiles; in 
addition to the aforementioned differences, a key difference is that their use of 
penalization was for parameter estimation, not for variable selection as aimed 
here. 

An EM algorithm is derived as follows. The E-step gives Qp as shown in (2). 
The M-step maximizes Qp with respect to the unknown parameters, resulting in 
the same updating formulas for and tt^ as given in (4) and (5). In Appendix 
B, we derive the following theorem: 

Theorem 1. The sufficient and necessary conditions for puk to he a (global) 
maximizer of Qp are 



En 
j=l '''ijXjk 



En 
.7 = 1 ' 



+ lAifel sign{jlik), if and only if fiik ^ (10) 



/(^ik < Al, if and only if fiik = 0, 



resulting in a slightly changed formula for the mean parameters 



(11) 



En 
J = l 



T-y- ^Jk 



j=l tj 



1 



(12) 



For the variance parameters, some algebra yields the following theorem: 

Theorem 2. The necessary conditions for afj, to be a local maximizer of Qp 
are 

\2 ^ 



1 



X2sign{a^,^~l), if a-,^ ^ 1 



(13) 



and 



{Xjk - hkf 



1. 



(14) 



Although a sufficient condition for afj^ = 1 can be derived as a special case 
of Theorem 5, we do not have any sufficient condition for af^. ^ 1. Hence, we 
do not have a simple formula to update af^^ . Below we characterize the solution 
af)., suggesting a computational algorithm as well as illustrating the effects of 
shrinkage. 

Let aik = A2sign((Tfj,-l), b, = X]?=i '^1^/2, and Cik = J2l=i ^iji^. 



■jk ~ t^ik 



)V2, 



then (13) reduces to aikof^. + feicr^c - Cik = 0, while (14) becomes \bi - Cik\ < A2. 
Note that afi^ ~ Cik/bi is the usual MLE when A2 = 0. It is easy to verify that 
if afi^ = 1, then (t|j. = 1. Below we consider cases with A2 > and a-f/^ ^ 1. It is 
shown in Appendix B that 
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1. if \h - Qfel > A2, 



1 , 1 I SlgIl(cit-b,)A2C.t 

2 + \/ 4 fc2 



(15) 



is the unique maximizer of Qp and is between 1 and aff. , 
2. if |6i - Cik\ < A2, 

(a) if (t|j, > 1, then afj^ = 1 is the unique maximizer; 

(b) if aff, < 1, i) if 6i — Cik < X2, then a^f. = 1 is a local maximizer; there 
may exist another local maximizer between a'^f. and 1; between the 
two, the one maximizing Qp is chosen; ii) if bi — Cik — X2, then cither 

= c.fc/A2 e {al, 1) (if < 1/2) or al = 1 (if af, > 1/2) is the 
unique maximizer. 

Naturally the above formulas suggest an updating algorithm for al. Clearly, al 
has been shrunk towards 1 . and can be exactly 1 if. for example, A2 is sufficiently 
large. 



2.3.2. Regularization of variance parameters: scheme two 

The following penalty is adopted for both mean and variance parameters: 

g K g K 

PA(e) = Al ^ ^ l/x., I + A2 ^ ^ I log al I . (16) 

i=i fe=i i=i k=i 

Note that the only difference between (9) and (16) is the penalty of the variance 
parameters, where \al — 1| is replaced by |logcr?;.|, which is used to shrink 
logcrl^ to (i.e. al to 1) if log cr^; is close to 0. Therefore, variable selection can 
be realized as before. 

An EM algorithm for the variance parameters is derived as follows. 

Theorem 3. The sufficient and necessary conditions for al to be a local max- 
imizer of Qp are 

( 1 {xjk-f^ik? \ \2sign{\ogal) ^ 

j—l \ ik ik / ik 

and 

^ ( 1 , {Xjk - fitk f \ 

+ 2 j 

j=i \ / 

If we denote b, = X^J^i and c^k = Yl'j=i '^iji^jk - fi'ikY/'^, then (17) 

reduces to al = al{l + A2sign(log(Tj^j;,)/6j), while (18) becomes \bi ~ Cik\ < A2, 
where al = Cik/bi is the usual MLE for A2 = 0. Derivations in Appendix B 



<A2, tfal = l. (18) 
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imply that sign(log(Tj^j,) 
cases, we obtain 



sign (log a^;.) = sign(cifc — bi). Combining the two 



1 + A2sign(Qfe - b^)/b. 



Cik\ 



\2)+ + 1 (19) 



The above formula suggests an updating algorithm for afj. . When A2 is small, 



sign(|6i - Qfel - A2)+ 
sufficiently large, sign(|6; 



1, afj. has been shrunk from af^, towards 1; when A2 is 



Cik \ — A2)+ = 0, then af^. is exactly 1 



2.4- Using adaptive penalization to reduce bias 

We can adopt the idea of adaptive penahzation, as proposed by Zou (2006) [57] 
for regression, in the present context. Following Pan et al. (2006) [40], we use a 
weighted Li penalty function 

g K g K 

PA (0) = Al ^ ^ I I + A2 ^ ^ [a^fc - 1 1 , (20) 

1=1 k=l 1=1 k=l 

where Wik = and Vik = ^/\(Jik ~ Ij"" with w > 0, and fiik and af)^ are 

the MPLE obtained in section 2.3.1; we also tried the usual MLE in Wik and 
Vik, but it did not work well in simulations, hence we skip its discussion; we only 
consider w = 1. The EM updates are slightly modified for the purpose: we only 
need to replace Ai and A2 by XiWik and X2Vik respectively, while keeping other 
updates unchanged. 

The main idea of adaptive penalization is to reduce the bias of the MPLE 
associated with the standard Li penalty: as can be seen clearly, if an initial 
estimate \flik\ is larger, then the resulting estimate is shrunk less towards 0; 
similarly for the variance parameter. 



2.5. Penalized clustering with grouped variables 

Now we consider a situation where candidate variables can be grouped based on 
the prior belief that either all the variables in the same group or none of them 
are informative to clustering. Following the same idea of the grouped Lasso of 
Yuan and Lin (2006) [-54], we construct a penalty for this purpose here. 

Suppose that the variables are partitioned into M groups with the cor- 
responding mean parameters fit = (/ia, Mj2, ■ • ■ , Mi-R")' = (/^i 'j mI': ■ • ■ j /^f^')'; 
dim(/i™) = km, and X]m=i = Accordingly, we decompose xj = (xj/, x^/, 
. . . , xf'/)/ and Vi = diag{afj^,af2, • ■ • , crfj^) = diag{Vii,Vt2, Vim) with Vim 
diagonal matrix, and cr,f is the column vector containing the 
diagonal elements of matrix Vim ■ 

For grouping mean parameters, we will use a penalty p\{Q) containing 

g M 

i—l m—1 
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for the mean parameters, where ||v|| is the L2 norm of vector v. Accordingly, 
we use 

g M 

i—1 m— 1 

as a penalty for grouped variance parameters. Note that we do not have to 
group both means and variances at the same time. For instance, we may group 
only means but not variances: we will thus use the second term in (9) as the 
penalty for variance parameters while retaining the above penalty for grouped 
mean parameters. 

The E-step of the EM yields Qp with the same form as (2). Next we derive 
the updating formulas for the mean and variance parameters in the M-step. 



2.5.1. Grouping mean parameters 



If the penalty for grouped means is used, we have the following result. 



Theorem 4. The sufficient and necessary conditions for fi ~ ip-"^), i = 1, 2, . . . , g 

M to he a unique maximizer of Qp are 



and m ~ 1,2,..., 



im I / J 

and 



yielding 



where = 
usual MLE. 



-, if and only if 7^ 0, 



< Al v km, if and only if fi" 



sign 1 — 



^1=1 



, and A™ = J2]'^^ njX'f/YTj^i n: 



(21) 
(22) 

(23) 

s the 



It is clear that, due to thresholding, /i™ = when, for example, Ai is suffi- 
ciently large. Noting that i/™ depends on /i™, we use (23) iteratively to update 



2.5.2. Grouping variance parameters 



If the penalty for grouped variances is used, we have the following theorem: 
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Theorem 5 

and m 



1,2, 
1 



The sufficient and necessary conditions for a. 
. . . , M , to be a local maximizer of Qp are 



2 



1, i = 1,2, 



< Xo 



< A, 



i=i 



1 



,771 \ 2 



< 0; 



otherwise. 



The necessary condition for a: 



2 

1 



^ 1 to 6e a Zoca^ maximizer of Qp is 
X-iVk^icrf - 1) 



2a. 



2 



2(^LJ^ 



(24) 



(25) 



It is clear that af„^ ~ 1 when, for example. A2 is large enough. It is also 
easy to verify that (24) and (25) reduce to the same ones for non-grouped 
variables when km — 1. To solve (24) and (25), we develop the following 
algorithm. Let a™ = A2 V^/||crj^„j - 1|| , bi = iJ2'i=i'''v f^)^ ^^^^d Cim = 



Mr 



'/2. Consider any fc'th component aff., oi af correspond- 



ingly, bik' and Cimk' are the k'th components of bj and Cim, respectively. In 
Appendix B, treating aim as a constant (i.e. by plugging-in a current estimate 
of ct^^to), we show the following cases, i) If af/^, = 1, then af/., = 1 is a maximizer 
of Qp as other cr^c'^ 7^ ^' fixed, ii) If afj^, = Cimk' /bit' > 1, there exists 

only one real root satisfying a^j., € {1,(7^^,); a bisection search can be used to 
find the root, iii) If af/., ~ Cimk' /bik' < 1, the real roots must be inside (tr^fc/, 1), 
hence a bisection search can be used to find a root; once a root is obtained, the 
other two real roots, if exist, can be obtained through a closed-form expression; 
we choose the real root that maximizes Q p (while other aff, for k ^ k' are fixed 
at their current estimates) as the new estimate of uf^, . After cycling through all 
k' , we update aim with the new estimate. Then the above process is iterated. 



2.5.3. Other grouping schemes 

To save space, we briefly discuss grouping variables under a common diagonal 
covariance matrix, for which only mean parameters need to be regularized. The 
EM updating formula for the mean parameters remains the same as in (23) 
except that the cluster-specific covariance Vim there is replaced by a common 
Vm\ updating formulas for the other parameters remain unchanged. Simulation 
results (see Xie et al. (2008) [52]) demonstrated its improved performance over 
its counterpart without grouping. In addition, we can also group the mean 
parameters for the same variable (or gene) across clusters (Wang and Zhu 2008 
[50]), and combine it with grouping variables (Xie et al. 2008 [52]). 

The grouping scheme discussed so far follows the grouped Lasso of Yuan and 
Lin (2006) [54], which is a special case of the Composite Absolute Penalties 
(CAP) of Zhao et al. (2006) [56]. In Appendix A, we derive the results with the 
CAP, including using both schemes on regularizing the variance parameters. 
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2.6. Model selection 

To introduce penalization, following Pan and Shen (2007) [39] and Pan et al. 
(2006) [40], we propose a modified BIG as the model selection criterion: 

BIC = -2 log L(e) + log(n)4 

where = g + K + gK — I — q is the effective number of parameters with 
q = 4t^{(i,k) : jjin^ — 0,(7?;, ~ 1}. The definition of de follows from that in Li- 
penalized regression (Efron et al. 2004 [')]; Zou et al. 2004 [V)]). This modified 
BIC is used to select the number of clusters g and the penalization parameters 
(Al , A2) jointly. We propose using a grid search to estimate the optimal [g, Ai , A2) 
as the one with the minimum BIC. 

For any given ((;,Ai,A2), because of possibly many local maxima for the 
mixture model, we run an EM algorithm multiple times with random starts. 
For our numerical examples, we randomly started the K- means and used the 
K-means' results as an initialization for the EM. From the multiple runs, we 
selected the one giving the maximal penalized log-likelihood as the final result 
for the given {g, Ai, A2). 

3. Simulations 

3.1. A common covariance versus unequal covariances 
3.1.1. Case I 

We first considered four simple set-ups: the first was a null case with g = 1; 
for the other three, g — 2, corresponding to only mean, only variance, and 
both mean and variance differences between the two clusters. Specifically, we 
generated 100 simulated datasets for each set-up. In each dataset, there were 
n = 100 observations, each of which contained K ~ 300 variables. Set-up 1) is 
a null case: all the variables had a standard normal distribution iV(0, 1), thus 
there was only a single cluster. For each of the other three set-ups, there were 
two clusters. One cluster contained 80 observations while the other contained 
20; while 279 variables were noises distributed as A''(0, 1), the other 21 variables 
were informative: each of the 21 variables were distributed as 2) N{0, 1) in 
cluster 1 versus 7V(1.5,1) in cluster 2; 3) iV(0, 1) versus 7V(0,2); 4) N{0,1) 
versus 7V(1.5, 2) for the three set-ups respectively. 

For each simulated dataset, we fitted a series of models with the three num- 
bers of components g = 1, 2, 3 and various values of penalization parameter(s). 
For comparison, we considered both the equal covariance and unequal covari- 
ance mixture models (8); for the former, we considered the unpenalized method 
(Al = 0) corresponding to no variable selection and penalized method using BIC 
to select A; similarly, for the latter we considered five cases corresponding to fix- 
ing or selecting one or two of (Ai, A2): no variable selection with (Ai, A2) = (0, 0), 
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Table 1 

Simulation case I: frequencies of the selected numbers (g) of clusters, and mean numbers of 
predicted noise variables among the true informative (z\) and noise variables (zi). Here N\ 
and N2 were the frequencies of selecting UnequalCov(\\,\2) (with variance regularization 
scheme one) and EqualCov(Xi) by BIC, respectively. UnequalCov(\\ ,\2) (logvar) used 
variance regularization scheme two. For set-up 1, the truth was g = 1, 21 = 21 and 
22 = 279; for others, g = 2, zi = and 22 = 279 
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52 21.0 279.0 


51 21.0 279.0 
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100 21.0 279.0 
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38 2.5 277.4 


40 2.1 275.0 








34 





(adapt) 
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10 3.5 277.6 


9 3.4 275.4 








3 






penalizing only mean parameters with (Ai,A2) = (Ai,0), penalizing only vari- 
ance parameters with (Ai, A2) = (0, A2), and our proposed two methods of pe- 
nalizing both mean and variance parameters with (Ai,A2) = (Ai,A2). We also 
compared the numbers of predicted noise variables among the true 21 informa- 
tive (zi) and 279 noise variables (22). 

The frequencies of selecting g = 1 to 3 based on 100 simulated datasets are 
shown in Table 1. First, our proposed methods performed best, in general, in 
terms of selecting both the correct number of clusters and relevant variables. 
For example, it selected fewest noise variables and most informative variables. 
Second, no variable selection (i.e. no penalization) led to incorrectly selecting 
g — I for the three non-null set-ups. Third, penalizing only the mean parameters 
could not distinguish the two clusters differing only in variance as in set-up 3. 
Fourth, between the two regularization schemes for the variance parameters, 
based on both cluster detection and sample assignment (Table 2), the two gave 
comparable results, though scheme two with log- variance performed slightly 
better. 

The results for adaptive penalty for set-up 3 are detailed in row 3 (adapt) 
of Table 1, which are similar to that of using the Li-norm penalty in terms of 
both variable and cluster number selection. Since the performance of adaptive 
penalty might heavily depend on the choice of weights (or initial estimates), we 
expect improved performance if better weights can be used. 
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Table 2 

Sample assignments for g = 2 and (adjusted) Rand index (RI/aRI) values for the two 
regularization schemes for the variance parameters for simulation set-ups 2-4- #Corr 
represents the average number of the samples from a true cluster correctly assigned to an 

estimated cluster 
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g = 2 
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0.87 
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0.36 
0.49 
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Li (var-1) 
Li (logvar) 


78.4 
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1.00 


0.98 
0.99 


0.97 
0.99 


0.93 
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0.99 
1.00 
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Table 3 

Simulation case II. The mean numbers of the predicted noise variables as in each of the first 
three groups of true informative and the other group of noise variables were given by zi-za; 
the truth was that g = 2, zi = Z2 = zg = and Z4 = 300 — 3Ki 
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99 

1 
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0.2 
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0.9 
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0.1 
0.0 



266.3 
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100 





1 
81 
18 



10.0 
0.01 
0.0 



10.0 10.0 270.0 

9.0 0.8 266.6 

9.1 0.7 267.6 



3.1.2. Case II 

We considered a more practical scenario that combined clusters' differences in 
means or variances or both for informative variables. As before, for each dataset, 
n = 100 observations were divided into two clusters with 80 and 20 observations 
respectively; among the K = 300 variables, only ZKi were informative while 
the remaining K — 3Ki were noises; The first, second and third Ki informative 
variables were distributed as i) iV(0, 1) for cluster 1 versus iV(1.5, 1) for cluster 
2, ii) A^(0, 1) versus N{0, 2), iii) N{0, 1) versus A^(1.5, 2), respectively; any noise 
variable was distributed iV(0, 1). We considered Ki = 5, 7, and 10. 

The results are shown in Table 3. Again it is clear that our proposed method 
worked best: it most frequently selected the correct number of clusters {g = 
2), and used most informative variables while being able to weed out most 
noise variables. As expected, using noise variables, as in non-penalized methods 
without variable selection, tended to mask out the existence of the two clusters. 
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Table 4 

Simulation case II for grouped variables 
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3.2. Grouping variables 

Wc grouped variables for each simulated data under case II. We used correct 
groupings: the informative variables were grouped together, and so were the 
noise variables; the group sizes were 5, 7 and 5 for Ki = 5,7 and 10 respectively. 
Table 4 displays the results for grouped variables. Compared to Table 1, it is 
clear that grouping variables improved the performance over non-grouped one 
in terms of more frequently selecting the correct number g = 2 oi the clusters 
as well as better selecting relevant variables. Furthermore, grouping both means 
and variances performed better than grouping means alone. 

4. Example 
4.1. Data 

A leukemia gene expression dataset (Golub et al. 1999 [1.']) was used to demon- 
strate the utility of our proposed method and to compare with other meth- 
ods. The (training) data contained 38 patients, among which 11 were AML 
(acute myeloid leukemia) while the remaining were ALL (acute lymphoblas- 
tic leukemia) samples; ALL samples consisted of two subtypes: 8 T-cell and 
19 B-cell samples. For each sample, the expression levels of 7129 genes were 
measured by an Affymetrix microarray. Following Dudoit et al. (2002) [7], we 
pre-processed the data in the following steps: 1) truncation: any expression level 
Xjk was truncated below at 1 if Xjk < 1, and above at 16,000 if Xjk > 16,000; 
2) filtering: any gene was excluded if its max/min < 5 and max — min < 500, 
where max and min were the maximum and minimum expression levels of the 
gene across all the samples. Finally, as a preliminary gene screening, we selected 
the top 2000 genes with the largest sample variances across the 38 samples. 



B, Xie et al. /Penalized model-based clustering 



184 



Table 5 

Clustering results for Golub's data with the number of components (g) selected by BIC 



Methods 


UnequalCov{Ai , A2) 


EqualCov(Ai ) 


(0,0) (Ai,A2) 


(0) (Al) 


RI/aRI 


0.73/0.46 0.85/0.65 


0.70/0.37 0.70/0.25 


BIC 


71225 52198 


75411 63756 


Clusters 


1 2 12 3 4 


12 3 123456789 10 11 


Samples(#) 
ALL-T(8) 
ALL-B(19) 
AML(ll) 


8 8 
17 2 11 1 7 
11 11 


80 700000100 
08 11 045240011 1 1 
00 11 000074000 



4-2. No grouping 

4-2.1. Model-based clustering methods 

Table 5 displays the clustering results: the two penalized methods selected 4 
and 11 clusters, respectively, while the two standard methods chose 2 and 3 
clusters, respectively. For the new penalized method, we show the results for 
scheme one of regularizing the variance parameters; the other scheme and the 
adaptive penalization both gave similar results, and hence are skipped. In terms 
of discriminating between the luekemia subtypes, obviously the new penalized 
method performed best: only one ALL B-cell sample was mixed into the AML 
group, while others formed homogeneous groups. In contrast, with a large num- 
ber of clusters, the Li method with an equal covariance still misclassified 4 ALL 
B-cell samples as AML. The two standard methods perhaps under-selected the 
number of clusters, resulting in 11 and 10 mis-classified samples, respectively. 
Unsurprisingly, based on the Rand index (Rand 1971 [42]) (or adjusted Rand 
index (Hubert and Arable 1985 [22])), the new method was a winner with the 
largest value at 0.85 (0.65), compared to 0.73 (0.46), 0.70 (0.37) and 0.70 (0.25) 
of the other three methods. In addition, judged by BIC, the new penalized 
method also outperformed the other methods with the smallest BIC value of 
52198. Finally, the new penalized method used only 1728 genes, while penalizing 
only means with a common covariance matrix used 1821 genes; the other two 
standard methods used all 2000 genes. 

Figure 1 displays the estimated means and variances of the genes in different 
clusters. Panels a)-c) clearly show that the genes may have different variance 
estimates across the clusters, though many of them were shrunk to be exactly to 
be one, as expected. Note that, due to the standardization of the data yielding 
an overall sample variance one for each gene, we do not observe any gene with 
the variance estimates more than one in two or more clusters. Panels d)-f) 
confirmed that there appears a monotonic relationship between the mean and 
variance, as well-known in the microarray literature (e.g. Huang and Pan 2002 
[19]); the Pearson correlation coefficients were estimated to be 0.64, 0.69, 0.65 
and 0.63 for the four clusters respectively. Hence, it lends an indirect support 
for the use of cluster-specific covariance matrices: if it is accepted that the genes 
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Fig 1. Scatter plots of the estimated means and variances by the new penalized method. Panels 
a)-c) are scatter plots of the estimated variances in cluster 1 versus those in cluster 2, 3 and 
4, respectively; panels d)-g) are the scatter plots of the estimated means versus estimated 
variances for the four clusters respectively. 



have varying mean parameters across clusters, then their variance parameters 
are expected to change too. 

Next we examine a few genes in more details. CST3 (cystatin c, M23197) and 
ZYX (zyxin, X95735) were in the top 50 genes ranked by Golub et al. (1999) 
[15], and two of the 17 genes selected by Antonov et al. (2004) [2] to discriminate 
between the AML/ALL subtypes. In addition, the two genes, together with MAL 
(X76223), were also identified among the top 20 genes used in the classifier by 
Liao et al. (2007) [2s] to predict leukemia subtypes. Bardi et al. (2004) [4] 
used CST3 to assess glomerular function among children with leukemia and 
solid tumors and found that CST3 was a suitable marker. Koo et al. (2006) 
[2G] reviewed the literature showing the relevance of MAL to T-cell ALL and 
concluded that it might play an important role in T-cell activations. Baker et 
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Fig 2. Observed expression levels of two pairs of genes and the corresponding clusters found 
by the two penalized methods. 



al. (2006) [3] and Wang et al. (2005) [49] identified ZYX as the most significant 
gene for classifying AML/ ALL subtypes. Tycko et al. (1991) [48] found that 
LCK (M26692) was related to activated T cells and thus it might contribute to 
the formation of human cancer. Wright et al. (1994) [-"jl] studied the mutation 
of LCK and concluded that it probably played a role in some human T-cell 
leukemia. 

In Figure 2, panels c)-d) are the zoom-in versions of the left bottom of a)-b), 
the plots of gene pair CST3 and MAL for all samples for the two penalized 
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Fig 3. Penalized mean and variance estimates for cluster one containing the 11 ALL B-cell 
samples by the new penalized method. 

methods respectively, while e)~/) are for gene pair LCK and ZYX with all 
samples. Given two genes, their observed expression levels, along with the 95% 
confidence region of the center for each cluster, were plotted. The penalized 
method with an equal covariance found 11 clusters, among which 5 clusters had 
only one sample, and the remaining 6 clusters had more than 2 samples; for 
clarity, we only plotted the confidence regions of the centers of the six largest 
clusters. Panels a) and c) clearly show evidence of varying variances, and thus 
cluster-specific covariance matrices: for example, MAL was highly expressed 
with a large dispersion for ALL-T samples, so was CST3 for AML samples, 
in contrast to the low expression of both genes for ALL-B samples, suggesting 
varying cluster sizes. It also clearly illustrates why there were so many clusters 
if an equal covariance model was used: the large number of the equally-sized 
clusters were used to approximate the three or four size-varying true clusters. 
Panel c) also suggests an explanation to the use of two clusters for ALL-B 
samples by the new penalized method: the expression levels of MAL and CST3 
were positively correlated, giving a cluster not parallel with either coordinate; 
on the other hand, use of a diagonal covariance matrix in the penalized method 
implied a cluster parallel to one of the two coordinates. Hence, two coordinate- 
parallel clusters were needed to approximate the non-coordinatc-parallel true 
cluster; a non-diagonal covariance matrix might give a more parsimonious model 
(i.e. with fewer clusters). 

Finally, we show the effects of shrinkage and thresholding for the parameter 
estimates by the new penalized method. Figure 3 depicts the penalized mean 
estimates (panel a) and variance estimates (panel b) versus the sample means 
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Table 6 

Clustering results for Golub's data by PAM and K-means methods. The number of clusters 



selected by the silhouette width are masked by * 


Methods 


PAM K-means 


RI/aRI 


0.46/0 0.65/0.24 0.64/0.22 0.67/0.27 0.80/0.53 0.75/0.42 


Clusters 


1 2* 12 3 1234 12 3 1 234 12345 6* 


Samples (#) 


7 1 17 1700 53 0080 000080 
11 8 83 8 8371 10 7 2 10 207 19080 1 
11 11 11 090 01 10 10 01 04700 


ALL-T (8) 
ALL-B (19) 
AML (11) 



and variances respectively for cluster one. It is confirmed that the penalized 
mean estimates were shrunk towards 0, some of which were exactly 0, while the 
penalized variance estimates were shrunk towards 1, and can be exactly 1. 

4-2.2. Other clustering methods 

Previous studies (e.g. Thalamuthu ct al. 2006 [44]) have established model-based 
clustering as one of the best performers for gene expression data. Although it 
is not our main goal here, as a comparison in passing, we show the results of 
other three widely used methods as applied to the same data with the top 2000 
genes: hierarchical clustering, partitioning around mcdoids (PAM) and K-means 
clustering. 

It is challenging to determine the number of clusters for PAM and K-means. 
Here we consider two proposals. First, we used the silhouette width (Kaufman 
and Rousseeuw 1990 [24]) to select the number of clusters. Two and six clusters 
were chosen for PAM and K-means respectively; neither gave a good separation 
among the three leukemia subtypes (Table 6). Second, to sidestep the issue, we 
applied the two methods with three and four clusters because those numbers 
seemed to work best for model-based clustering. Nevertheless, PAM worked 
poorly, while K-means with 4 clusters gave a reasonable result, albeit not as 
good as that of the new penalized model-based clustering, as judged by an 
eye-ball examination or the (adjusted) Rand index. 

Figure 4 gives the results of hierarchical clustering with all three ways of 
defining a cluster-to-cluster distance: average linkage, single linkage and com- 
plete linkage. The average linkage clustering gave the best separation among the 
three leukemia subtypes: all 8 T-cell samples, except sample 7, were grouped 
together; there were 10 B-cell samples in a group; all other ALL samples seemed 
to appear in the AML group. On the other hand, the average linkage clustering 
identified about six outlying samples, which were samples 7, 18, 19, 21, 22 and 
27 respectively; this finding was consistent with that of the penalized model- 
based clustering with an equal covariance matrix, which detected the same five 
outliers except sample 19. 
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Average linkage clustering 




Fig 4. Agglomerative hierarchical clustering results for the 38 leukemia samples: the first 8 
samples were T-cell ALL; samples 9-27 were B-cell ALL; the remaining ones were AML. 

4-2.3. Other comparisons 

Although mainly studied in the context of supervised learning, with several 
existing studies, Golub's data may serve as a test bed to compare various clus- 
tering methods. Golub ct al. (1999) [15] applied self-organizing maps (SOM): 
first, with two clusters, SOM mis-classified one AML and 3 ALL samples; sec- 
ond, with four clusters, similar to the result of our new penalized method, AML 
and ALL-T each formed a cluster while ALL-B formed two clusters, in which 
one ALL-B and one AML samples were mis-classified. They did not discuss why 
or how g = 2 or g = 4 clusters were chosen. In Bayesian model-based clustering 
by Liu et al. (2003) [30], two clusters were chosen with one AML and one ALL 
mis-assigned; they did not discuss classification with ALL subtypes. 

In a very recent study by Kim et al. (2006) [25], with two clustering algo- 
rithms and two choices of a prior parameter, they presented four sets of clus- 
tering results. In general, ALL samples formed one cluster while AML samples 
formed 5 to 6 clusters, giving 0-3 mis-assigned ALL samples; although not dis- 
cuss explicitly, because either all or almost all the ALL samples fell into one 
cluster, their method obviously could not distinguish the two subtypes of ALL. 
Furthermore, their result on the multiple clusters of AML was in contrast to 
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ours and Golub's on the homogeneity of AML samples. Because Kim et ah used 
different data pre-processing with 3571 genes as input to their method, for a fair 
comparison, we apphed the same dataset to our new penahzed method, yielding 
five clusters: only one ALL-B was mis-assigned to a cluster containing 10 AML 
samples, one cluster was consisted of one ALL-B and AML samples, while the 
other three clusters contained 8 ALL-T, 10 ALL-B and 7 ALL-B respectively. 
For this dataset, our method seemed to work better. 

It was somewhat surprising that there were about 1800 genes remaining for 
the penalized methods, though previous studies showed that there were a large 
number of the genes differentially expressed between ALL and AML; in par- 
ticular, Golub et al. (1999) [15] stated that "roughly 1100 genes were more 
highly correlated with the AML- ALL class distinction than would be expected 
by chance"; see also Pan and Shen (2007) [3!)] and references therein. For a 
simple evaluation, we apphed the elastic net (Zou and Hastie 2005 [58]) to the 
same data with top 2000 genes; the elastic net is a state-of-the-art supervised 
learning method specifically designed for variable selection for high-dimensional 
data and is implemented in R package elasticnet. Five-fold cross-validation 
was used for tuning parameter selection. As usual, we decomposed the three- 
class problem into two binary classifiers, ALL-T vs AML, and ALL-B vs ALL, 
respectively. The two classifiers eliminated 395 and 870 noise genes, respectively, 
with a common set of 227 genes. Hence the elastic net used 1773 genes, a number 
comparable to those selected by the penalized clustering methods. 

4-3. Grouping genes 

The 2000 genes were grouped according to the Kyoto Encyclopedia of Genes and 
Genomes (KEGG) Pathway database (Kanehisa and Goto 2000 [23]). About 43 
percent of the 2000 genes belonged to at least one of 126 KEGG pathways. If 
a gene was in more than one pathway, it was randomly assigned to one of the 
pathways to which it belonged. Each genes un-annotated in any pathway was 
treated as an individual group with size 1. Among the 126 KEGG pathways, 
the largest pathway size was 44, the smallest one was 1 and the median size was 
4; about three quarters of the pathways had sizes less than 8. 

The clustering results with the grouped mean and variance penalization were 
exactly the same as that of UnequalCov and kept 1795 genes. Among the 205 
identified noise genes, 23 genes were from 17 KEGG pathways: all contained 
only one genes except only three pathways, each containing 2, 3 and 4 genes 
respectively. 

To further evaluate the above gene selection results, we searched a Leukemia 
Gene Database containing about 70 genes that were previously identified in the 
literature as leukemia-related (www .bioinf ormatics . org/legend/leukjlb .htm) 
Among these informative genes, 58 were related to 21 leukemia subtypes, among 
which only 47 and 36 genes appeared in the whole Golub's data and the 3337 
genes after preprocessing respectively. Among the top 2000 genes being used 
for clustering, there were only 30 genes in the Leukemia Gene Database, most 
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Table 7 

The genes in the Leukemia Gene Database that were retained in or removed from the final 
model for the grouped method. The six genes in italic font were annotated in KEGG 

pathways 





Leukemia Subtype 


Gene Name 




Acute Lymphoblastic Leukemia 


MYC, ZNFNIAI 




Acute Myelogenous Leukemia 


IRFl, GMPS 




Acute Myeloid Leukemia 


CBFB, NUP214, HOXA9, FUS, 
RUNXl 




Acute Promyelocytic Leukemia 


PML 




Acute Undifferentiated Leukemia 


SET 




B-cell Chronic Lymphocytic Leukemia 


BCL3, BTGl 




Myeloid Leukemia 


CLC 


Retained 


pre B-cell Leukemia 


PBXl, PBX3 




T-cell Leukemia 


TCL6 




T-cell Acute Lymphoblastic Leukemia 


NOTCH3, LYLl, LM02, TAL2 




Cutaneous T-cell Leukemia 


NFKB2 




Human Monocytic Leukemia 


ETSl 




Mast cell Leukemia 


KIT 




Mixed Linkage Leukemia 


MLL3 




Acute Myeloid Leukemia 


LCPl 




Acute Myelogenous Leukemia 


RGS2 


Removed 


Murine Myeloid Leukemia 


EV12B 




pre B-ccU Leukemia 


PBX2 




T-cell Leukemia 


TRA@ 



of which were not in any KEGG pathways; only 7 genes appeared in KEGG 
pathways: GMPS, ETSl, N0TCH3, MLL3, MYC, NFKB2 and KIT. Table 7 
lists the genes that were selected in and deleted from the final model. Among 
the 205 noise genes selected by our group penalized method, five of them were 
annotated in the Leukemia Gene Database, among which one was related to 
AML. 

Because most of the known leukemia genes were not in any KEGG pathways, 
reflecting perhaps the current lack of prior knowledge, the grouped method could 
not be established as a clear winner over the none-grouped method in terms of 
leukemia gene selection in the above example. To confirm the potential gain with 
a better use of prior knowledge, we did two additional experiments. First, in ad- 
dition to the KEGG pathways, we grouped all the 19 leukemia genes not in any 
KEGG pathways into a separate group: the samples were clustered as before; 
among the 200 genes removed from the final model, there were only two leukemia 
gene, ETSl, which was related to human monocytic leukemia, neither AML nor 
ALL, and N0TCH3, related to T-cell ALL. Second, in addition to the KEGG 
pathways, we grouped the AML ("acute myeloid leukemia" in Table 7) or ALL 
("acute lymphoblastic leukemia" and "T-cell acute lymphoblastic leukemia") 
genes into two corresponding groups while treating the other leukemia genes 
individually: again the samples were clustered as before; among the 216 genes 
removed from the final model, ETSl, RGS2, EVI2B, PBX2, TRA@ were the 
four leukemia genes and there was no single gene related to AML or ALL. These 
two experiments demonstrated the effectiveness of grouping genes based on bi- 
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ological knowledge, and that, as expected, the quahty of the prior knowledge 
would influence performance. Nevertheless, our work here is just a first step, 
and more research is necessary to establish the practical use of grouping genes 
for microarray data. 

5. Discussion 

We have proposed a new penalized likelihood method for variable selection in 
model-based clustering, permitting cluster-dependent diagonal covariance ma- 
trices. A major novelty is the development of a new Li penalty involving both 
mean and variance parameters. The penalized mixture model can be fitted 
easily using an EM algorithm. Our numerical studies demonstrate the utility 
of the proposed method and its superior performance over other methods. In 
particular, it is confirmed that for high-dimensional data such as arising from 
microarray experiments, variable selection is necessary: without variable selec- 
tion, the presence of a large number of noise variables can mask the clustering 
structure underlying the data. Furthermore, we have also studied penalties for 
grouped variables to incorporate prior knowledge into clustering analysis, which, 
as expected, improves performance if the prior knowledge being used is indeed 
informative. 

The present approach involves only diagonal covariance matrices. It is argued 
that for "high dimension but small sample size" settings as arising in genomic 
studies, the working independence assumption is effective, as suggested by Fra- 
ley and Raftery (2006) [12], as well as demonstrated by the popular use of a 
diagonal covariance matrix in the naive Bayes and other discriminant analyses 
due to its good performance (Bickel and Levina 2004 [-5]; Dudoit et al. 2002 
[7]; Tibshirani et al. 2003 [47]). Nevertheless, it is worthwhile to generalize the 
proposed approach to other non-diagonal covariance matrices, possibly built on 
the novel idea of shrinking variance components as proposed here. However, this 
task is much more challenging; a main difficulty is how to guarantee a shrunk 
covariance matrix to be positive definite, as evidenced by the challenge in a sim- 
pler context of penalized estimation of a single covariance matrix (Huang et al. 
2006 [21]; Yuan and Lin 2007 [o'>]). An alternative approach is to have a model 
intermediate between the independent and unrestricted models. For example, 
in a mixture of factor analyzers (McLachlan et al. 2003 [35]), local dimension 
reduction within each component is realized through some latent factors, which 
are also used to explain the correlations among the variables. Nevertheless, be- 
cause the latent factors are assumed to be shared by all the variables while in 
fact they may only be related to a small subset of informative variables, vari- 
able selection may still be necessary; however, how to do so is an open question. 
Finally, although our proposed penalty for grouped variables provides a general 
framework to consider a group of genes, e.g. in a relevant biological pathway 
or functional group, for their either "all in" or "all out" property in clustering, 
there remain some practical questions, such as how to choose pathways and 
how to handle genes in multiple pathways. These interesting topics remain to 
be studied. 
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Appendix A: Composite Absolute Penalties (CAP) 

We generalize our proposed group penalization, including the two regularization 
schemes on variance parameters, to the Composite Absolute Penalties (CAP) of 
Zhao et al. (2006) ['tG], which covers the group penalty of Yuan and Lin (2006) 
[54] as a special case. 

For grouping mean parameters, the following penalty function is used for the 
mean parameters: 



g M 



1/70 



(26) 



\2— 1 m— 1 / 

where + l/7m = 1; 7m > 1 E^nd ||v||-y,^ is the L-y^ norm of vector v. 

Accordingly, we adopt 



g M 



,70/7;. |j 2 



IF" 



or 



g M 



1/70 



1/70 



(27) 



(28) 

\ i—1 m—1 / 

as a penalty for grouped variance parameters. To achieve sparcity, as usual, we 
use 7o = 1. 

The E-step of the EM yields Qp with the same form as (2). Next we derive 
the updating formulas for the mean and variance parameters in the M-step. 

A.l. Grouping mean parameters 

If the CAP penalty function (26) for grouped means is used, we can derive the 
following Theorem: 

Theorem 6. The sufficient and necessary conditions for fi = (/i™), i = 1, 2, . . . , 17 
and m = 1, 2, ... , M , to be a unique maximizer of Qp are 



and 



if and only if fi"^ 7^ 



yielding 



/7™ sign{p:^)\n. 



m \ I . .m |7„i — 1 



i=i 



< Aifc: 



l/7,'„ 



if and only if fi^ 



sign 1 



Aifc. 



1/7™ 



(29) 
(30) 

(31) 
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the usual MLE. 



IS 



Proof. Consider two cases: 

i) ^ 0. First, by definition and using the Holder's inequality, we can prove 
that the L-y^ norm is convex, thus the penalty function for grouped means is 
convex in /i™. Second, treating Qp as the Lagrange multiplier for a constrained 
optimization problem with the penalty as the inequality constraint, and con- 
sidering that both minus the objective function and the penalty function are 
convex, by the Karush-Kuhn- Tucker (KKT) condition, we have the following 
sufficient and necessary condition 

from which we can easily get (29). 

ii) = 0. By definition, we have 

Qp{0, ■) > Qp(A/i™, .) for any A^™ close to 

i 
j 

^ Y.^V^T'^^rr!^I^Tm^^T\U^ - ^ T,, ( AmD'^-' A/l™/ (2 1 1 A/ir 1 ) 
J j 

Notice (A/ir)/V^™ A//™/(2||Am™||^„) ^ 0+ as A/i™ ^ 0. By the Holder's 

inequality, we have |e, n^x^/V-^ miJiTWi^ | < || n,x^'y^'r^\\lL^ and 
the " = " can be attained. Thus the above inequality is equivalent to (30). □ 

It is clear that, if Ai is large enough, /i™ will be exactly due to thresholding. 
Since depends on /i™, we use (31) iteratively to update /i™. 



A. 2. Grouping variance parameters 

A. 2.1. Scheme 1 



If the penalty function (27) for grouped variances is used, we have the following 
theorem: 
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Theorem 7. The sufficient and necessary conditions for af ^ — 1, i — 1,2, . . . , g 
and m = 1,2,..., M , to be a local maximizer of Qp are 



n 
3 = 1 



il-i(a:™-u™)2 
2 2^ ^ ' 



1'^ 



'/E 



i=i 
otherwise. 



-l-ix'f^-^j^f 



<0; 



(32) 



The necessary condition for cr~^ ^ 1 to be a local maximizer of Qp is 
1 1 



E' 



(xf - 



2af 



Km - 1|I7: 



(33) 

Proof. If (t|„j = 1 is a local maximum, by definition, wc have the following 
sufficient and necessary condition 

Qp{l, .) > Qp{l + Aal„,^, .) for any Acrf „ near 



E- 

i 

E- 



1 



(xf Mr)'(^r - 



--log|diag(l + Aa2„)| 



Thus, 



i(x7 - Mr)/diag(l + Aaf,„)-i(x™ - 



A2fci^''"||Aa2 



E- 



--log|diag(l + Aa2„)| 



+ 2(^r - Mr)'diag(Aa?,„/(l + Aa^_„))(x™ - /i™) 



Using Taylor's expansion, we have 



E- 



1 , ixf-^i'n' 



3 = 1 



-1 

2 



+ 0{c'{AalJ^)<X,kK'-\\Aa. 



for some constant vector c. After dividing both sides by ||A(T^„||-y,^ and using 
the same argument as before, we obtain (32) as the sufficient and necessary 
condition for „ = 1 to be a local maximizer of Qp. 



B. Xie et al. /Penalized model-based clustering 



196 



Setting the first-order derivative of Qp equal to 0, we have (33), the necessary 
condition for af^^ 7^ 1 to be a local maxiniizer of Qp. □ 

It is clear that we have erf — 1 when, for example, A2 is large enough. It is 
also easy to verify that the above conditions reduce to the same ones for afj^ = 1 
for non-grouped variables when k,n = 1 and reduce to (24) and (25) for grouped 
variables when 7,„ = 7,'„ = 2. 



A. 2. 2. Scheme 2 



If we use the CAP penalty function (28) for grouped variances, then the following 
theorem can be obtained by a similar argument as before: 

Theorem 8. The sufficient and necessary conditions for (t|,j^ = 1 , i = 1, 2, . . . , 3 
and m ^ 1,2,..., M , to be a local maximizer of Qp are 

1 1 



n 



-i--(x™-Mr)' 
l^-^i-T-i^rr 



1/7' 

< X2km « 



^E- 

i=i 



-i-(x™-Mrf 



1/7' 

<A2fc,n ™ otherwise. 



The necessary condition for of ^ 1 to be a local maximizer of Qp is 



<0; 



(34) 



E' 



1 



Hrn 



A2 fcm sign{log al,^ ) \ log cr^ ^„ | ^ 



(35) 



Appendix B: Proofs 

B.l. Derivation of Theorem 1 

Since Qp is differentiable with respect to when fiik 7^ 0, while non-differen- 
tiable at /i^fc = 0, we consider the following two cases: 

i) If fiik 7^ is a maximum, given that Qp is concave and differentiable, then 
the sufficient and necessary condition for /Xife to be the global maximum of Qp 
is 

n 

dQp/dfiik = ^ Tij{xjk - fJ.ik)/crfk - Aisign(/Xifc) = 0, 

from which we have (10). 

ii) If fiik = is a maximum, we compare Qp(0, .) with Qp{Afj,ik, .), the values 
of Qp at /iife = and ^ik = 'AfJ-ik respectively (while other components of 
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are fixed at its maximum). By definition, we fiave 

Qp(0, .) > Qp{Afiik, .) for any Afi.k near 

j 3 



'ijXjk 



/f^ik < ^1 as A/i^fc ^ 0. 



It is obvious that from (10) we fiave sign(^ife) = sign(X]"=i njXjk/ Z]?=i )> 



tfius 



T,]Linj Xjk f _ Aia4sign(/^\ _ E"=i ^u^jfc / _ Au, 

I — »M I -L ^ — »n I I — »r) I J- I * — ^ rj 



wliicfi, in combination with (11), yields (12). 



B.2. Derivation of Theorem 2 

Since Qp is differentiable with respect to af^. when afi, ^ 1, we know a local 



maximum ct^ must satisfy the following conditions 
^ Qp{Q-M'-^)\.^,. =0 ifaf.^l; 



(36) 



p(l, •) > '3p(1 + A(T,^j., .) if o-fj, = land for any Acrfj, near 0. 



where . \w Qp{l, .) represents all parameters in Qp except cr,^ 



ik ■ 



Notice that Qp = Ci+E, [-5 log'^ffc + ^2 - ^(a:,fe - M«fe)V^?fc] -^2^?^- 
1 1 + C's > where Ci , C2 and C3 arc constants with respect to crfj, . Therefore the 
first equation of (36) becomes 



1 



A2sign(a2, - 1) = 0, if <7,f, ^ 1 



from which we can easily get (13). 

Starting from the second equation of (36), we have 



LHS 
RHS 



ci + E- 
^i + E- 



-Uog{\) + C2-\{x,k-^^^k?/l 



A2I1-1I + C3, 



-i log(l + Aa^,,) + C2- \{x,k - M,fc)V(l + Aa.2,,) 



-A2|Aa2,|+C3, 
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and thus 

\ E + ^'^^fe) - (^J'^ - M^fc)'(l/(l + Aaf,) - 1)] < A2|Aaf,|. 

Using Taylor's expansion, we have 

j=i V / 
leading to 

E-.. (-^ + i^^i^^) sign(Aa|,) + 0{\Aal\) < X,. 
letting Aafi^ —^ 0, we obtain (14). 



B.3. Derivation of a^i^ in section 2.3.1 

Note that from (13) we have {—dQp/da1^)(j^j^ — aikcrjj^ +^i'^ik ~ '^ik = 0. Define 
f{x) = aikX^ + biX - c^k = 0. 

First, we consider the case with \bi — Cik\ < A2. 

i) When afj, > 1, f{af^) = OifcO-^ + =X2^fk > 0, and f{x) = A2sign(a; - 
l)x^ + biX - Cik = A2a;^ + hiX - dk > hx - dk > haff. - c^k = if x > af^^ > 1. 
On the other hand, \im^_,i+ f{x) = X2 + hi ~ Cik > 0, since |6,; — c,jt| < A2; and 
f{x) = — A2X^ + biX — Cik < —X2X^ + bi — Cik < if X < I. Thus, based on the 
signs of /(a;), Qp has a unique local maximum at a"^/. = 1. 

ii) When af^ < 1, we have f{<yfi,) = -X2^fk < 0, and f{x) = -X2X^ + 6,x - 
Cik < biX - Cik < haf^. - Cjfc = if X < ct^^; lim^^i- f{x) = -A2 + &i - Cik < 0; 
and f{x) = X2X^ + biX — Ci > A2 + fei — > if a; > 1. However, for afj, < a; < 1, 
f{x) = — A2X^ + biX — dk is a continuous and quadratic hmction, which may 
have two roots 

b, ± y^b'i - 4X2C,k 

•"^^^ ^ ^2 ■ 

If bi — Cik < A2, then Van^^i- f{x) < 0, implying that, according to the signs 
of f{x) around x = I, x = 1 is a local maximum of Qp, and the smaller of 
a;i.2 is also a local maximum (if it exists); on the other hand, if 5, — Cik = A2, 
then limj,_+]^- f{x) = 0, implying that either a; = 1, the smaller root of xi 2 if 
d'fk ^ or X ~ Cik/X2, the larger root of a;i,2 if i^fk < 1/2, is the unique 
maximum. 

Second, we claim that, if — Cifc| > A2, there exists a unique local maximizer 
^ik 7^ 1 for Qp and it must lie between 1 and af^, = Cik/bi, the usual MLE 
without penalty. This can be shown in the following way. 

i) When af^. > 1, fi^i^.) = a.k^fk + ^X2af^. > 0, and f{x) = A2sign(a; - 
l)x-^ + biX - Cik = A2X^ + biX - Cik > b^x - c^k > bt^f^ - Qfc = if a; > af,. > 1. 



B. Xie et al. /Penalized model-based clustering 



199 



On the other hand, hm2._fi+ f{x) = A2 + bi — Cik < 0, since bi — Cik < — A2; and 
f{x) = —X2X^ + biX — dk < —Xix^ + hi — Cik < if X < 1. Thus f{x) has a 
unique root aff^ £ ih^lk) 



a) When afi^ < 1, similarly we have f{a'fk) < 0, and f{x) ~ —X2X^ + biX 



ik ^ v-L' "ikJ 
1, sim 

Cik < biX - Cik < bicrfk ~ ^ik = if x < ct^j.; lim^^i- f{x) = -A2 + bi - dk > 0; 
and f{x) = A22;^ + biX — Ci > bi — q > if x > 1. Thus f{x) has a unique root 

Based on the signs of f{x) around x = af/^, it is easy to see that o-^j, is indeed 
a local maximizer. 

Third, (13) can be expressed as 



A20-,ft. + b.,af^. - c.,k = 0, if b^ - c^k > A2 

(37) 



A20"! + bicrl - dk = 0, if bi - Cik < -A2. 



From the first equation of (37), we get af). ~ (bt± \/b1 — ^-\iCik)j /2A2. 

Since b, + \/&f~"4A^ > + \/(QfeTA^F^"4A^ > {c^k + A2) + \c,k - A2I 
> 2A2 and that bi — Cik > X2 implies aff. = Cik/bi < 1 while (t|j, must be be- 
tween af,. and 1, we only have one solution afk = (^bi — \/bf — 4A2Cifc^ /2A2 

— ^ik/ ^ \J\ ~ ^1^^) ■ Fi'o™ the second equation, similarly we get af^. = 
^Ikl (l + ^\ + ■ Combining the two cases, we obtain (15). 

Note that the expression inside the square root of (15) is non-negative. To 
prove it, we only need to show that for b\ + 4sign(cifc — bi)\iCik. Consider two 
cases: 

i) When dk - 5i > A2 > 0, 

hi + 4sign(c,fe - 5,)A2C,fc = 5^ + 4A2C,fc > b\ ^ 4A2(6, + A2) = (6, + 2A2)' > 0. 

ii) When c^k — bi < — A2 < 0, 

bl + 4sign(c,fe - 5,)A2C,fc ^ b^ ^ AX^c.k > bf - 4A2(6, - A2) = {b, - 2X2f > 0. 



B.4- Derivation of Theorem 3 

We prove the necessary conditions below, while the sufficiency is proved as a 
side-product in Appendix B.5. 

Since Qp is diffcrcntiable with respect to a^j^ when aff, 7^ 1, we know a local 
maximum afk must satisfy the following conditions 



^Qp(e;eM) 



^<^^k 



if^lfc ^1; 

(38) 



,Qp(l, •) > Qp(l + Acrfj., .) if af^ = 1 and for any Acrfj. near 0. 
where . in (5p(l, .) represents all parameters in Qp except a^^. 
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Notice that Qp = Ci + J2j '^tj [-^^ogaff. + C2 - ^{xjk - ^-kk)^ /crf^] 
X2 1 log afi^ I + C3 , where Ci , C2 and C3 arc constants with respect to cr|j, . 



Therefore the first equation of (38) becomes 

1 , {xjk~f^tk)^\ A2sign(logo-,2^) 



2-^ 



2^fk 



0, ifa4^1 



from which we can easily get (17). 

Starting from the second equation of (38), we have 



LHS ^ 
RHS = 

and thus 



ll0g{l) + C2-\{x,k-f^:kf/l 



A2|l0gl|+C3, 



-i log(l + Aal) + C2- i(x,fe - M^fc)V(l + Aaffc) 



A2|log(l + Aa2,)|+C3, 



i ^ [- log(l + Aal) - {x,k - /i.fc)'(l/(l + Aa?,) - 1)] < A2I log(l + Aa, 



Using Taylor's expansion, we have 

^^'^ r2 + — 2 — 

i=i ^ 
leading to 



Ad?, + OiiAalf) < A2I log(l + Aal)\, 



1 (l-jA; - fiikY 



signiAal) + Oi\Aal\)<X2 



letting ActI^, 0, we obtain (18). 



log(l + AafJ 



A-i. 



B.5. Derivation of a^j^ in section 2.3.2 

Let fia^) = -2a,f,(aQp/aaf,) = a|,[l + A2sign(logaf,)/6.] - a^,, where f{x) 
is defined as f{x) = x[l + A2sign(logx)/6j;] — af^. Thus (17) is equivalent to 
fi^ik) — 0- First, we consider the case with \bi — Cik\ < A2, the necessary 
condition of afj. = 1. 

i) When > 1, /(x) = x[l + A2/&.] - > 1 + A2/6. - Qfc/fe, = [(6, - 
Cife) + A2]/&i > if x > 1. On the other hand, f{x) = x[l — X2/bi] — af^ 
{x — aff^) — x\2/bi < —xX2/bi < if a; < 1 < afj,. Thus, based on the signs of 
f{x), Qp has a unique local maximum at aff^ = 1. 

ii) When aff, < 1, we have Cik/bi < 1, thus < 6^ — Cjk < A2. f{x) = 
x[l + A2/6i] — aff. = {x — afi^) + xX2/bi > if a; > 1. On the other hand, 
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f{x) = a-[i - A2/&.] - = x[h - X2]/h - < xc,k/h - a^ ^{x- < 

if < a; < 1. Thus, based on the signs of f{x), Qp has a unique local maximum 
at = 1. _ 

i) and ii) indicates that \bi ~ Cik\ < A2 is also the sufficient condition of 

Second, we claim that, if — Ci/c| > A2, there exists a unique local maximizer 
^ik 7^ 1 for Qp and it must He between 1 and al ~ Cik/bi, the usual MLE 
without penalty. This can be shown in the following way. 

i) When al > 1, we have bi < Cit, and further Cik — bi > A2. Notice f{x) = 
x[l — X2/bi] — al = {x — al) — xX^/bi <0 ifO<a:<l. Thus the possible root 
of f{x) — should be larger than 1. For a; > 1, f{x) = x[l + \2/bi\ — al is a 
linear function of x. f{al) =\2al/bi > 0, and lima;^i+ f{x) = [1 + A2/6i] — al 
= ibi-Cik + \2)/bi < 0. Thus /(x) = has a unique root al = al/{l + X2/bi) e 

ii) When al < 1, we have bi > Cik, and further bi — Cik > A2. Notice 
f{x) — x[l + X2/bi] — al = {x — al) + x\2/bi > if a; > 1. Thus the possible 
root oi f{x) = should be smaller than 1. For a; < 1, f{x) = x[l — \2/bi]~al is a 
linear function of a; . f{al) =-X2al/bi < 0, andlim^.^!- /(x) = [^-X2/bi]-al 
= {bi — Cik — X2)/bi > 0. Thus f{x) = has a unique root al = al/{l — X2/bi) e 

i^fk^)- 

Based on the signs of f{x) around x = al, it is easy to see that al is indeed 
a local maximizer. And al — al/(l + sign(cifc — bi)X2/bi) 

B.6. Derivation of Theorem 4 

Consider two cases: 

i) /i™ 7^ 0. First, by definition and using the Cauchy-Schwarz inequality, we 
can prove that the L2 norm is convex, thus the penalty function for grouped 
means is convex in /x^. Second, treating Qp as the Lagrange multiplier for a 
constrained optimization problem with the penalty as the inequality constraint, 
and considering that both minus the objective function and the penalty function 
are convex, by the Karush-Kuhn- Tucker (KKT) condition, we have the following 
sufficient and necessary condition 

^Qp/^^^T = ^ J2^^lV-,'K - - AiV^KVIlMril = 0, 

j 



from which we can easily get (21). 
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ii) = 0. By definition, we have 

Qp{0, .) > Qpi^fiT^ ■) foi- any A/i™ close to 

3 



< 



Plugging-in A/x™ = '''ij^im^T ^'^'^ letting a ^ 0, we obtain (22) from the 
above inequality. On the other hand, by the Cauchy-Schwarz inequality, we have 

n,xffV-^'Af,T/\\AtiT\\\ <\\Ej njxJ^'V-,'\\, and because is positive 
definite, we obtain the above inequality from (22). 



B.7. Derivation of Theorem 5 

If (jf „i = 1 is a local maximum, by definition, we have the following sufficient 
and necessary condition 

(3p(1, •) > (3p(1 + Aa2„, .) for any Aaf „ near 
1 



Thus, 



E- 

3 

E- 



Ci > 



■ log |diag(l + Act, 



' ) 



--(zj' - Mr)'diag(l + AcTlJ-\xf - A^r) 



E- 



--log|diag(l + Aa2„J| 



1 



2^ J 
< \2\/k,n\\Aa^ 
Using Taylor's expansion, we have 



+ - Mr)/diag(Aaf.„/(l + Aa/.,J)« - ^T) 



E- 



1 



Aa 



■E' 



,2 )2 

z,m / 



0(c'(Aa^,,j3) < A2yfc::i|Aa2, 
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for some constant vector c. After dividing both sides by ||Acr|„^|| and using 
the same argument as before, we obtain (24) as the sufBcient and necessary 
condition for crf^^ = 1 to be a local maximizer of Qp. 

B.8. Characterization of solutions to (25) 

Consider any component k' , af^, of (t|„j. Equation (25) corresponds to 

bik' ^inik' / 2 -1 \ 

where ai„i ~ ^2\/km/\Wi m ^ ^^^^ bik' and Cimk' are the fc'th components of 
bi and Cim respectively. If A2 = 0, then (t|j,, — Cimk' /bik' — fyfy^ usual MLE 
without penalization; if A2 7^ and we treat aim as a constant (i.e. by plugging- 
in a current estimate of uff., ) , the above equation becomes a cubic equation of 

^l',f{^l'\ 

f{x) = + ax^ +bx + c 

where a = -1,6 = bik' lai„,,c = -c^mk' I aim- 
Now we consider the following two cases: 

i) When d\, = Cimk' /hk' > 1, we have /(ct,V) = {a^jk'fi^fk' - 1) > 0, 
/(I) = 6(1 - dfy) < 0, fix) < for Va; < 1, and f{x) > for Va; > aff^,. 
Therefore, the real roots of this equation must be between 1 and afj^,. Recall 
the fact that an odd-order equation has at least one real root, and the sum of 
all three roots of this equation equals —a — 1, the equation must have only one 
real root af^., £ {l,afi^,). Because f{(yfy) = —dQp/dafy, based on the signs of 
fix) near x = afy , we know that ufy is a local maximizer. 

ii) Whenaf,, = c,rak'/b,k' < 1, we have /(a2,,) = (a2,,)2(a2,,-l) < 0, /(I) = 
^(1 ~ ^fk') > 0> fi^) > V.T > 1, and fix) < for Vx < af^^,. Therefore, 
the real roots of this equation must be between aff^, and 1. By factorization, we 
have 

fix) — x^ + ax^ + bx + c = ix — xi)ix^ + (a + xi)x + b + axi + x^) 

where xi is a root of fix) = 0. Thus, if we use a bisection search to find the 
first root Xi, the other two (real or complex) roots of fix) = are 



X2,3 = |^-(a + xi) ± Y -Sxf - 2axi + a'^ - 4b j /2. 

If there is more than one real root, we choose the one maximizing Qp as the 
new estimate aff., . 

Appendix C: Simulation 

C.l. Comparison of the two regularization schemes 

We investigated the performance of the two regularization schemes for the vari- 
ance parameters for set-up 3 in simulation case I. There were 36 (or 5) out of 



B. Xie et al. /Penalized model-based clustering 



204 



Table 8 

The mean numbers of the genes ( or variables ) whose penalized variance parameters were 
exactly one by the two regularization schemes, averaged over the datasets with g = 2 and 
g = 3 respectively. "Overlap" gives the common genes between the two regularization 
schemes or between/among the two/three clusters 







9 = 2 




9 = 


3 








Cluster 1 


Cluster 2 


Overlap 


Cluster 1 


Cluster 2 


Cluster 3 


Overlap 




-Li(var-l) 


281.0 


283.2 


278.0 


288.0 


290.0 


286.0 


286.0 


3 


-Ll(logvar) 


277.3 


279.2 


272.5 


288.0 


290.0 


286.0 


286.0 




Overlap 


276.8 


278.9 


271.9 


288.0 


290.0 


286.0 


286.0 



100 datsets which were identified with 2 (or 3) clusters by both (var-1) and 
log(var) methods. Table 8 summarizes the numbers of the genes with their pe- 
nalized variance estimates as exactly one by either regularization scheme. For 
g = 3, the two schemes gave exactly the same number of the genes for each 
cluster and discovered the same genes with their variances estimated as one 
across all 3 clusters. For g = 2, the results of the two schemes were also similar, 
though scheme one (i.e. penalizing var-1) identified slightly more genes with 
their variances estimated to be one for each cluster and more genes across both 
the clusters than did scheme two. 

Figure 5 compares the variance parameter estimates by the two regulariza- 
tion schemes and the sample variance estimates based on the estimated sample 
assignments for the estimated .9 = 2 clusters for one simulated dataset in set-up 
3. Due to the construction of the simulation data and standardization, the true 
cluster with 80 samples always had sample variances smaller than 1 for infor- 
mative variables, while the other cluster with 20 samples always had sample 
variances larger than 1 for those informative variables. Compared to the sam- 
ple variance estimates, the penalized estimates from both schemes were clearly 
shrunken towards 1, and could be exactly 1. Between the two schemes, they gave 
similar estimates for cluster 2, but scheme 1 in general shrank many variance 
parameters more than scheme 2, which was in agreement with and explained 
the results in Table 8. 

Appendix D: Golub's data 

D.l. Comparison of the two regularization schemes 

Figure 6 compares the MPLEs of the variance parameters given by the two 
regularization schemes for Golub's data with the top 2000 genes. Although the 
two schemes in general gave similar MPLEs, scheme 1 seemed to shrink more 
than did scheme 2, especially if cr^ > 1. 

Figures 7-8 compare the MPLEs from the two schemes with the sample 
variances based on the final clusters. The effects of shrinkage to and thresholding 
at 1 by the two regularization schemes were striking. In particular, there was a 
clear thresholding in MPLE when the sample variances were less than and close 
to 1 for scheme 1 (Figure 7) . To provide an explanation, we examined expression 
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a) Scheme 1 vs scheme 2 b) Scheme 1 vs scheme 2 




c) Scheme 1 d) Scheme 1 




0.5 1.0 1.5 2.0 2.5 3.0 0.5 0.6 0.7 0.8 0.9 1.0 1.1 

Sample variance Sample variance 

e) Scheme 2 f) Scheme 2 




0.5 1.0 1.5 2.0 2.5 3.0 0.5 0.6 0.7 0.8 0.9 1.0 1.1 

Sample variance Sample variance 



Fig 5. Comparison of the two regularization schemes on the variance parameters for one 
dataset of set-up 3. is MPLE for cluster i by scheme s. 



(15) given in the paper. We notice that if af/^ (in the form of the usual MLE) 
is less than 1, and A2 is large enough, then the MPLE aff. < a1^/{\/2 + 0) = 
2dff. < 2(1 - \2/bi) < 1. Therefore, a^^ did have a ceiling at 2(1 - X2/h)- 



D.2. Comparison with Kim et al. (2006)'s method 

Wc applied our penalized clustering methods to the Golub's data that were 
pre-processed as in Kim et al. (2006) [2.'j], resulting in 3571 genes; see Table 9. 
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a) Variance estimates of cluster 1 b) Variance estimates of cluster 2 




c) Variance estimates of cluster 3 d) Variance estimates of cluster 4 




Fig 6. Comparison of the two regularization schemes on the variance parameters for Golub's 
data with the top 2000 genes. X-axis and y-axis give the MPLEs by scheme 1 and scheme 2 
respectively. 

The standard methods without variable selection under-selected the number of 
clusters at 2, failing to distinguish between ALL-T and ALL-B, even between 
ALL and AML (for the equal covariance model), in agreement with our simu- 
lation results. Our proposed new penalized method could largely separate the 
AML samples and the two ALL subtypes; only two samples were mis-assigned. 
In contrast, Kim et aVs method could not separate the two subtypes of ALL 
samples. 
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a) Variance estimates of cluster 1 b) Variance estimates of cluster 2 
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Fig 7. Comparison of the penalized variance estimates by regularization scheme 1 and the 
sample variances for Golub 's data with the top 2000 genes. 



Table 9 

Clustering results for Golub's data with 3571 genes. The number of components (g) was 

selected by BIC 



Methods 




UnequalCov 


(Ai.Aa) 










EqualCov 


(Al) 
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(Ai.Aa) 






(0) 






(Al) 






BIC 
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a) Variance estimates of cluster 1 b) Variance estimates of cluster 2 
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Fig 8. Comparison of the penalized variance estimates by regularization scheme 2 and the 
sample variances for Golub 's data with the top 2000 genes. 
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